This project aims to provide a predictive model of home prices for the Davidson County, Tennessee. Be able to predict home prices and realize the features that add or diminish the value to a home are critical not only to homeowners and buyers but also essential to the housing market. The existing housing market predictive models that Zillow built for the County seem lack of considerations of local factors, which may lead to the reduction of accuracy to some of the predictions. Therefore, our goal is to generate a reliable model that taken the local intelligence into consideration for Zillow and help to provide robust data for its clients.
The data wrangling and feature engineering processes are challenging for this project. The limited open data sources available for the Davidson County makes the data collection difficult to proceed. In addition, it is also hard to identify whether the data resource is reliable and trustworthy. Finding the features that best explain the home price values and its spatial patterns is the key to build a robust and effective model. The process building up new features based on datasets we obtained from the internet not only requires the background in the local housing market but a lot of attempts and tryouts.
The overall modeling strategy is based on the statistical tools we learnt in class. With the dataset gathered from online resources and provided by the instructor, we built a multiple linear regression model explaining the existing sale price and train the model within the training dataset. We then use the model to predict the unknown sale prices. We are able to understand the performance of our model by examining the amount of errors between predicted and observed sale prices.
Besides the dataset given by the instructor, we obtained the additional data from the Nashville Open Data Portal primarily. Census data such as the number of individuals living in poverty, number of individuals with a bachelor’s degree, and so on were downloaded through the package “tidycensus” from the Census API. We also gathered data from Google API for grocery stores, retail stores, universities, and clinics. In the feature engineering process, we used our domain knowledge to create new features that are relevant to the dependent variable, Sale Price, we are predicting for.
SalePrice — Davidson County Home Price.
LocationCity — The City name.
LocationZip — Zip code.
CensusBlock — Census block number.
LandUseFullDescription — Land use.
neighjud — Dummy variable within or outside the neighborhood district.
num_vacant_unit — Number of vacant units.
WhiteAlone — Number of household individuals claim themselves as white.
Poverty — Individuals below 100% poverty level.
Unemploy — Unemployed individuals.
BachDegree — Individuals with a bachelor’s degree.
Parks — Within or outside the 0.25 mile buffer of the park.
d_prisecroads — Distance to primary and secondary roads.
d_Retail — Distance to retails.
d_clinics — Distance to clinics.
d_grocery — Distance to groceries.
d_crime — Distance to reported aggravated crime.
age — Year of built.
nasimp — First two numbers of the neighborhood code used by the Assessor’s office to group similar properties for the purpose of determining property value.
Acrage — Acres of land.
Story_Height — Number of stories of the building.
Exterior_Wall — Exterior wall type.
Frame — Building frame type.
units_building — the number of units for multi-family like a duplex.
sf_finished — Square feet of finished area.
sf_bsmt — Square feet of the basement if any.
ac_sfyi — Central air. 0 = no central air; 1 = central air (Residential).
Phys_Depreciation — Building condition.
NumofUnits_land — Units used for appraisal.
Zone_Assessor — Zones (jurisdictions). 9 large areas of the county used by the appraisal staff to coordinate appraisal teams.
Land_Unit_Type — Type of units.
baths — Number of baths.
Fixtures — Estimated plumbing fixtures.
Foundation — Type of foundation.
AveSale2 — Average sale price of nearest two neighbors.
AveSale5 — Average sale price of nearest five neighbors.
AveSale10 — Average sale price of nearest ten neighbors.
Building Characteristics
===============================================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
-------------------------------------------------------------------------------
housing_units 8,415 2,061.680 707.295 450 1,565 2,582 4,911
num_vacant_unit 8,415 188.027 144.506 8 92 247 1,580
SalePrice 8,415 312,147.700 307,978.500 2,000 150,000 376,840 6,894,305
Acrage 8,415 0.209 0.321 0.000 0.000 0.270 8.160
sf_finished 8,415 1,843.708 884.946 348 1,238 2,206 10,608
sf_bsmt 8,415 185.892 450.246 0 0 0 3,531
NumofUnits_land 8,415 75.530 1,825.649 0 1 1 116,741
Fixtures 8,415 9.623 3.658 3 7 12 38
-------------------------------------------------------------------------------
Spatial Structure
===============================================================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
-----------------------------------------------------------------------------------------------
CensusBlock 8,415 37,015,416.000 2,847.677 37,010,105 37,012,801 37,017,902 37,019,600
Parks 8,415 0.135 0.342 0 0 0 1
d_Retail 8,415 0.036 0.024 0.001 0.020 0.045 0.132
d_clinics 8,415 0.035 0.029 0.0005 0.015 0.043 0.133
d_grocery 8,415 0.020 0.012 0.002 0.011 0.026 0.056
d_crime 8,415 0.002 0.001 0.00001 0.001 0.003 0.008
Zone_Assessor 8,415 4.231 2.655 1 2 6 9
AveSale2 8,415 311,958.600 271,086.300 10,000 157,000 374,000 4,706,372
AveSale5 8,415 313,645.100 251,958.400 22,700 162,500 384,307.9 4,534,583
AveSale10 8,415 316,288.100 243,272.500 37,900.000 164,540.000 388,990.000 3,597,141.000
-----------------------------------------------------------------------------------------------
Census Tract
================================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
----------------------------------------------------------------
WhiteAlone 8,415 2,843.831 1,293.914 89 2,119 3,731 6,815
Poverty 8,415 756.843 497.803 26 345 1,024 2,623
Unemploy 8,415 145.165 116.541 0 64 183 643
BachDegree 8,415 792.530 437.939 45 438 1,154 2,050
----------------------------------------------------------------
A cluster of crime can be observed according to the map below. In the central district, the distance to aggravated crime is smaller compared to the surrounding regions. The map implies that the southeastern region is relatively safer than other regions in the study areas since the distances of reported crimes to each house are larger or perhaps there are less aggravated crimes reported in this region.
Besides the clustering within each census tract, the map below indicates that the neighborhoods in the southwestern county are wealthier or have less population living under the poverty level. Neighborhoods in the northwest of the County have a higher number of individuals living in poverty. This observation somehow reflects the Distance to Nearest 5 Reported Aggravated Crime Map: the neighborhoods have less exposure to aggravated crimes also have less poverty population.
The average sale price of nearest neighbors is one of the most powerful predictors to eliminate spatial autocorrelation. This variable explains the spatial pattern of the sale prices particularly well. By comparing the map with the Davidson County Home Prices Map above, one may notice that it reflects the clustering of high and low sale prices.
After data collecting and cleaning, we are able to perform OLS regression, a predictive method for estimating the unknown parameter - Unknown Sale Prices, to explore the relationships between existing sale prices and those explanatory variables (i.e., predictors). In this project, we used current sale price data and the selected predictors to train our regression model within the training dataset. By having the least amount of errors between the predicted sale prices and existing sale prices, we are able to use the model to estimate the missing sale prices in the test dataset. By looking at R-square, Mean absolute percentage error (MAPE), and Mean absolute error (MAE), We are able to critic the performance of our model.
This section includes visualizations of prediction result and model fit test results.
=========================================================================
Dependent variable:
---------------------------
log(SalePrice)
-------------------------------------------------------------------------
CensusBlock 0.00001**
t = 2.292
neighjud1 0.048***
t = 4.032
housing_units 0.00005*
t = 1.766
num_vacant_unit 0.00005
t = 0.711
WhiteAlone 0.00003**
t = 2.308
Poverty -0.0001***
t = -5.338
Unemploy 0.0004***
t = 4.877
BachDegree -0.0001***
t = -2.744
Parks -0.053***
t = -3.545
d_Retail -1.744**
t = -2.376
d_clinics -0.885
t = -1.426
d_grocery 2.294**
t = 2.112
d_crime 8.324
t = 1.564
LocationCityBRENTWOOD 0.240***
t = 4.858
LocationCityMADISON 0.095
t = 1.012
LocationCityNASHVILLE 0.134*
t = 1.876
LocationCityWHITES CREEK -0.105
t = -0.374
LocationZip37027
LocationZip37115
LocationZip37189
LocationZip37201 0.403***
t = 3.867
LocationZip37203 0.412***
t = 6.109
LocationZip37204 0.323***
t = 4.595
LocationZip37205 0.352***
t = 5.174
LocationZip37206 0.411***
t = 4.939
LocationZip37207 -0.075
t = -0.905
LocationZip37208 0.225***
t = 3.118
LocationZip37209 0.369***
t = 5.487
LocationZip37210 -0.123
t = -1.612
LocationZip37211 -0.089
t = -1.320
LocationZip37212 0.305***
t = 4.095
LocationZip37214 -0.067
t = -0.811
LocationZip37215 0.311***
t = 4.480
LocationZip37216 0.394***
t = 4.776
LocationZip37217 -0.237***
t = -3.125
LocationZip37218 0.094
t = 0.885
LocationZip37219 0.392**
t = 2.167
LocationZip37220 0.230***
t = 2.739
LocationZip37221
LandUseFullDescriptionRESIDENTIAL COMBO/MISC -0.174
t = -0.342
LandUseFullDescriptionRESIDENTIAL CONDO 0.128
t = 0.539
LandUseFullDescriptionSINGLE FAMILY 0.130
t = 0.548
LandUseFullDescriptionVACANT RESIDENTIAL LAND 0.551*
t = 1.884
LandUseFullDescriptionZERO LOT LINE -0.137
t = -0.568
nasimp2 -0.487***
t = -6.204
nasimp3 -0.825***
t = -5.526
nasimp7 -1.153**
t = -2.542
nasimp10 -0.089
t = -1.086
nasimp11 -0.436***
t = -5.557
nasimp12 -0.202***
t = -2.814
nasimp13 0.278*
t = 1.666
nasimp14 0.193*
t = 1.755
nasimp16 -0.061
t = -0.827
nasimp17 -0.440
t = -1.038
nasimp19 -0.269***
t = -2.593
nasimp20 0.249***
t = 2.917
nasimp21 -0.146*
t = -1.742
nasimp22 -0.025
t = -0.304
nasimp23 -0.118
t = -1.455
nasimp24 -0.132
t = -1.503
nasimp25 -0.094
t = -1.172
nasimp26 -0.188**
t = -2.304
nasimp27 -0.185**
t = -2.132
nasimp30 -0.216***
t = -2.640
nasimp31 -0.112
t = -1.615
nasimp32 -0.163**
t = -2.315
nasimp33 -0.375***
t = -5.664
nasimp34 0.036
t = 0.423
nasimp35 -0.461***
t = -5.223
nasimp36 -0.263***
t = -3.413
nasimp37 -0.088
t = -1.254
nasimp38 -0.035
t = -0.482
nasimp39 -0.226***
t = -3.258
nasimp40 -0.047
t = -0.710
nasimp41 -0.147**
t = -2.155
nasimp42 -0.064
t = -1.032
nasimp43 -0.180***
t = -2.720
nasimp44 -0.295***
t = -3.939
nasimp48 -0.329***
t = -4.284
nasimp49 -0.112
t = -0.692
nasimp60 0.012
t = 0.174
nasimp62 0.148*
t = 1.922
nasimp63 -0.129**
t = -2.121
nasimp64 -0.242***
t = -3.417
nasimp67 -0.173**
t = -2.381
nasimp69 -0.314***
t = -3.019
nasimp72 -0.346
t = -0.857
nasimp73 -0.335***
t = -4.097
nasimp92 -0.154
t = -0.765
nasimp93 -0.127
t = -0.427
Acrage 0.124***
t = 5.586
Story_Height1.25 STORY 0.069
t = 1.143
Story_Height1.5 STORY 0.013
t = 0.540
Story_Height1.75 STORY 0.023
t = 0.897
Story_Height2 STORY -0.018
t = -1.138
Story_Height2.25 STORY -0.134
t = -1.573
Story_Height2.5 STORY -0.004
t = -0.044
Story_Height2.75 STORY -0.230**
t = -2.208
Story_Height3 STORY 0.008
t = 0.242
Story_Height4 STORY -0.294
t = -1.240
Story_HeightBI-LEVEL 0.062
t = 1.107
Story_HeightCOM 3 STY 0.026
t = 0.064
Story_HeightCOM 4 STY 0.555
t = 1.360
Story_HeightSPLIT LEVEL 0.073
t = 1.602
Exterior_WallBRICK/FRAME -0.059***
t = -3.747
Exterior_WallCONC BLK 0.009
t = 0.100
Exterior_WallFRAME -0.091***
t = -6.466
Exterior_WallFRAME/STONE -0.177**
t = -2.091
Exterior_WallLOG -0.183
t = -0.432
Exterior_WallMETAL 0.159
t = 1.025
Exterior_WallSTONE 0.011
t = 0.227
Exterior_WallSTUCCO -0.125***
t = -3.260
FrameRESD FRAME 0.232***
t = 6.203
FrameRESD FRAME 0.172***
t = 7.140
FrameTYPICAL 0.188***
t = 8.998
age1830 1.634***
t = 2.828
age1860 1.994***
t = 3.450
age1890 1.327***
t = 3.155
age1900 1.364***
t = 3.267
age1910 1.294***
t = 3.147
age1920 1.291***
t = 3.170
age1930 1.324***
t = 3.252
age1940 1.295***
t = 3.184
age1950 1.267***
t = 3.118
age1960 1.214***
t = 2.987
age1970 1.170***
t = 2.877
age1980 1.271***
t = 3.126
age1990 1.331***
t = 3.272
age2000 1.398***
t = 3.438
age2010 1.483***
t = 3.645
units_building1 -0.003
t = -0.030
units_building2 0.012
t = 0.045
units_building4 -0.460
t = -1.085
units_building11 0.379
t = 0.899
sf_finished 0.0001***
t = 11.280
sf_bsmt -0.00000
t = -0.165
ac_sfyi1 -0.054
t = -1.492
Phys_DepreciationDilapidated -0.433**
t = -2.129
Phys_DepreciationExcellent 0.195
t = 0.675
Phys_DepreciationFair -0.144***
t = -3.640
Phys_DepreciationGood 0.134**
t = 2.178
Phys_DepreciationPoor -0.293***
t = -3.618
Phys_DepreciationVery Good 0.345**
t = 2.026
Phys_DepreciationVery Poor -0.337**
t = -2.474
NumofUnits_land 0.00000
t = 1.505
Zone_Assessor -0.007
t = -1.313
Land_Unit_TypeN NASHVILLE RPDLND 0.939*
t = 1.917
Land_Unit_TypeOH MAD RG RPDLND 0.410
t = 0.820
Land_Unit_TypePRIME SF 0.288
t = 0.609
Land_Unit_TypeR PRIME AC 0.470
t = 0.995
Land_Unit_TypeR PRIME SF 0.540
t = 1.281
Land_Unit_TypeR RESID`L SF 0.564
t = 1.235
Land_Unit_TypeR SITE VAL 0.476
t = 1.169
Land_Unit_TypeR SITE VAL RESD SITE VALUE 0.442
t = 1.083
Land_Unit_TypeR UNDVL SF 0.932**
t = 1.964
Land_Unit_TypeRPDLND 0.265
t = 0.459
Land_Unit_Types 0.248
t = 0.430
Land_Unit_TypeVNDY HBVLG RPDLND 0.538
t = 0.926
baths1 -0.109
t = -0.379
baths2 0.027
t = 0.092
baths3 0.008
t = 0.027
baths4 -0.133
t = -0.456
baths5 -0.315
t = -1.062
baths6 -0.308
t = -0.969
baths7 -0.606*
t = -1.729
baths8 -3.045***
t = -5.947
Fixtures -0.001
t = -0.274
FoundationCRAWL -0.046
t = -0.112
FoundationFULL BASEMENT -0.015
t = -0.035
FoundationPART BASEMENT -0.025
t = -0.060
FoundationPIERS 0.210
t = 0.482
FoundationSLAB -0.088
t = -0.217
FoundationTYPICAL -0.041
t = -0.071
AveSale2 0.00000***
t = 35.396
AveSale5 0.00000***
t = 8.099
AveSale10 -0.00000***
t = -11.936
Constant -428.107**
t = -2.241
-------------------------------------------------------------------------
Observations 8,415
R2 0.710
Adjusted R2 0.704
Residual Std. Error 0.404 (df = 8242)
F Statistic 117.351*** (df = 172; 8242)
=========================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
| Index | R Squared | MAE | MAPE |
| Value | 0.723 | 106279.974 | 0.351 |
The mean R squared of the cross validation result is: 0.788
The standard deviation of R squared of the cross validation result is: 0.129
According to the histogram of R Squared, the mdoel is not overfitting.
From the visual access of the Regression Residuals Map, the residuals are randomly distributed throughout the study area. According to the p-value of Moran’s I test, the residuals of the test set are randomly distributed across the map, meaning that this model has solved the problem of spatial auto-correlation pretty well.
According to the prediction result map above, higher home prices are expected in the southwest of the study area; the northern and southeastern parts of the study region have lower predicted home prices.
The MAPE by Zip for the Test Set result shows that the model performs better in the southern part of the study region (yellow color), and it is biased towards the northern and southeastern regions (dark blue), which have relatively lower predicted home prices.
In this section, an attempt to perform a spatial cross-validation has been made to measure the generalizability of the model accross neighborhoods with different level of income.
#Wealthy Neighborhood
richnum=5
richset<-rall_1%>%filter(neigh==richnum)
withoutrichset<-rall_1%>%filter(neigh!=richnum)
regwithoutrich<-lm(log(SalePrice)~.,data=withoutrichset%>%select(-kenID,-test,-longitude,-latitude,-neigh))
richval<-predict(regwithoutrich,richset)
richbind<-cbind('obspr'=richset$SalePrice,'logpr'=richval)%>%as.data.frame()%>%
mutate(predpr=exp(richval),
richness='Rich')
mapeofrich=mean(abs(richbind$obspr-richbind$predpr)/richbind$obspr)
maeofrich=mean(abs(richbind$obspr-richbind$predpr))
#Middle-Income Neighborhoods
mediannum=46
medset<-rall_1%>%filter(neigh==mediannum)
withoutmedset<-rall_1%>%filter(neigh!=mediannum)
regwithoutmed<-lm(log(SalePrice)~.,data=withoutmedset%>%select(-kenID,-test,-longitude,-latitude,-neigh))
medval<-predict(regwithoutmed,medset)
medbind<-cbind('obspr'=medset$SalePrice,'logpr'=medval)%>%as.data.frame()%>%
mutate(predpr=exp(medval),
richness='Middle-income')
mapeofmed=mean(abs(medbind$obspr-medbind$predpr)/medbind$obspr)
maeofmed=mean(abs(medbind$obspr-medbind$predpr))
#Low-Income Neighborhoods
poornum=266
#rank no.161 out of 181
poorset<-rall_1%>%filter(neigh==poornum)
withoutpoorset<-rall_1%>%filter(neigh!=poornum)
regwithoutpoor<-lm(log(SalePrice)~.,data=withoutpoorset%>%select(-kenID,-test,-longitude,-latitude,-neigh))
poorval<-predict(regwithoutpoor,poorset)
poorbind<-cbind('obspr'=poorset$SalePrice,'logpr'=poorval)%>%as.data.frame()%>%
mutate(predpr=exp(poorval),
richness='Poor')
mapeofpoor=mean(abs(poorbind$obspr-poorbind$predpr)/poorbind$obspr)
maeofpoor=mean(abs(poorbind$obspr-poorbind$predpr))
| Richness | MAPE | MAE |
|---|---|---|
| Rich | 0.202 | 116114 |
| Middle | 0.438 | 74581 |
| Poor | 0.356 | 32493 |
The result indicates that the model performs better in wealthy neighborhoods, moderately in poor neighborhoods and poorly in middle-income neighborhoods. However, the result is influenced by the number of observations in the training set and test set. Overall, the model is biased towards the lower-income neighborhoods.
Our predictive model performs moderately according to the results. The most interesting variables include the average distance to aggravated crimes, the number of individuals living in poverty by tract, and the average sale prices of nearest 2, 5 and 10 neighbors. The previous two variables explain the home prices’ variation regarding external influences and social-economic characteristics. The latter three variables derived directly from the dependent variable sale price are highly significant and well explained the spatial patterns of home prices variation. According to the Regression Result Table, the variables nasimp and age are also highly associated with the sale price. The finalized model we built has an R-square value of 0.788, meaning that about 79% of the variation of sale prices can be explained by the model. The mean absolute percent error (MAPE) is 0.356, meaning that the prediction is off by 35.6%. From the regression residual map, the residuals are randomly distributed throughout the study area. And the result of Moran’s I test justified the observation that the model has solved the spatial autocorrelation problem well. The result indicates that the model predicts particularly well in the southern part but performs poorly in the southeastern region. One of the reasons might be there are other contributing factors of the southeast region that we have not taken into consideration. Variables such as median household income may explain the variation of buying power among different neighborhoods.
We may not recommend our model to Zillow, because the performance of our model is not outstanding, which could lead to wrong expectations and frustration of clients. We could continually improve the accuracy and generalizability by further building up our domain knowledge and including more variables associated with home prices.